基于 GPT 文本嵌入与 CLIP 图像嵌入的多模态维基百科文章检索¶
本笔记本将展示如何使用 LlamaIndex 构建多模态检索系统。
维基百科文本嵌入索引:通过 OpenAI 的 GPT 模型生成文本嵌入
维基百科图像嵌入索引:使用 OpenAI 的 CLIP 模型生成图像嵌入
查询编码器:
- 使用 GPT 嵌入对文本索引的查询文本进行编码
- 使用 CLIP 嵌入对图像索引的查询文本进行编码
框架:LlamaIndex
实施步骤:
- 下载维基百科文章的原始文本和图像文件
- 使用 GPT 嵌入构建文本向量存储索引
- 使用 CLIP 嵌入构建图像向量存储索引
- 通过不同的查询编码嵌入和向量存储,同时检索相关文本与图像
%pip install llama-index-vector-stores-qdrant
%pip install llama_index ftfy regex tqdm
%pip install git+https://github.com/openai/CLIP.git
%pip install torch torchvision
%pip install matplotlib scikit-image
%pip install -U qdrant_client
加载并下载包含维基百科文本与图像的多模态数据集¶
解析维基百科文章并保存至本地文件夹
from pathlib import Path
import requests
wiki_titles = [
"batman",
"Vincent van Gogh",
"San Francisco",
"iPhone",
"Tesla Model S",
"BTS",
]
data_path = Path("data_wiki")
for title in wiki_titles:
response = requests.get(
"https://en.wikipedia.org/w/api.php",
params={
"action": "query",
"format": "json",
"titles": title,
"prop": "extracts",
"explaintext": True,
},
).json()
page = next(iter(response["query"]["pages"].values()))
wiki_text = page["extract"]
if not data_path.exists():
Path.mkdir(data_path)
with open(data_path / f"{title}.txt", "w") as fp:
fp.write(wiki_text)
解析维基百科图片与文本并加载至本地文件夹¶
import wikipedia
import urllib.request
image_path = Path("data_wiki")
image_uuid = 0
# image_metadata_dict stores images metadata including image uuid, filename and path
image_metadata_dict = {}
MAX_IMAGES_PER_WIKI = 30
wiki_titles = [
"San Francisco",
"Batman",
"Vincent van Gogh",
"iPhone",
"Tesla Model S",
"BTS band",
]
# create folder for images only
if not image_path.exists():
Path.mkdir(image_path)
# Download images for wiki pages
# Assing UUID for each image
for title in wiki_titles:
images_per_wiki = 0
print(title)
try:
page_py = wikipedia.page(title)
list_img_urls = page_py.images
for url in list_img_urls:
if url.endswith(".jpg") or url.endswith(".png"):
image_uuid += 1
image_file_name = title + "_" + url.split("/")[-1]
# img_path could be s3 path pointing to the raw image file in the future
image_metadata_dict[image_uuid] = {
"filename": image_file_name,
"img_path": "./" + str(image_path / f"{image_uuid}.jpg"),
}
urllib.request.urlretrieve(
url, image_path / f"{image_uuid}.jpg"
)
images_per_wiki += 1
# Limit the number of images downloaded per wiki page to 15
if images_per_wiki > MAX_IMAGES_PER_WIKI:
break
except:
print(str(Exception("No images found for Wikipedia page: ")) + title)
continue
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
构建基于文本与图像嵌入的多模态向量存储(分集合管理)¶
import qdrant_client
from llama_index.core import SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.indices import MultiModalVectorStoreIndex
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_db")
text_store = QdrantVectorStore(
client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
vector_store=text_store, image_store=image_store
)
# Create the MultiModal index
documents = SimpleDirectoryReader("./data_wiki/").load_data()
index = MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
/Users/haotianzhang/llama_index/venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
绘制从维基百科下载的图片¶
from PIL import Image
import matplotlib.pyplot as plt
import os
def plot_images(image_metadata_dict):
original_images_urls = []
images_shown = 0
for image_id in image_metadata_dict:
img_path = image_metadata_dict[image_id]["img_path"]
if os.path.isfile(img_path):
filename = image_metadata_dict[image_id]["filename"]
image = Image.open(img_path).convert("RGB")
plt.subplot(8, 8, len(original_images_urls) + 1)
plt.imshow(image)
plt.xticks([])
plt.yticks([])
original_images_urls.append(filename)
images_shown += 1
if images_shown >= 64:
break
plt.tight_layout()
plot_images(image_metadata_dict)
在独立集合 wikipedia_img 下构建单独的 CLIP 图像嵌入索引¶
def plot_images(image_paths):
images_shown = 0
plt.figure(figsize=(16, 9))
for img_path in image_paths:
if os.path.isfile(img_path):
image = Image.open(img_path)
plt.subplot(2, 3, images_shown + 1)
plt.imshow(image)
plt.xticks([])
plt.yticks([])
images_shown += 1
if images_shown >= 9:
break
获取示例查询的多模态检索结果¶
test_query = "who are BTS team members"
# generate retrieval results
retriever = index.as_retriever(similarity_top_k=3, image_similarity_top_k=5)
retrieval_results = retriever.retrieve(test_query)
from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode
retrieved_image = []
for res_node in retrieval_results:
if isinstance(res_node.node, ImageNode):
retrieved_image.append(res_node.node.metadata["file_path"])
else:
display_source_node(res_node, source_length=200)
plot_images(retrieved_image)
Node ID: e30e1817-4e31-4047-be5d-37502560920c
Similarity: 0.808149809808292
Text: BTS (Korean: 방탄소년단; RR: Bangtan Sonyeondan; lit. Bulletproof Boy Scouts), also known as the Bangtan Boys, is a South Korean boy band formed in 2010. The band consists of Jin, Suga, J-Hope, RM, Jimi...
Node ID: 024f3296-37c8-46d5-a184-2f78c621a99f
Similarity: 0.7987048642063129
Text: === Fandom ===
According to Kyung Hyun Kim, BTS's rise was facilitated by a great increase in music video programming and consumption on YouTube and the coming of an idol empire, including merchand...
Node ID: c564ccf4-a94f-408f-8b21-224538dc2e94
Similarity: 0.7838098925118134
Text: == History ==
=== 2010–2014: Formation and early years ===
BTS was formed in 2010, after Big Hit Entertainment CEO Bang Si-hyuk wanted to form a hip hop group around RM (Kim Nam-joon), an undergr...
test_query = "what are Vincent van Gogh's famous paintings"
# generate retrieval results
retriever = index.as_retriever(similarity_top_k=3, image_similarity_top_k=5)
retrieval_results = retriever.retrieve(test_query)
retrieved_image = []
for res_node in retrieval_results:
if isinstance(res_node.node, ImageNode):
retrieved_image.append(res_node.node.metadata["file_path"])
else:
display_source_node(res_node, source_length=200)
plot_images(retrieved_image)
Node ID: e002927c-0bf5-482b-a0a1-0ee2f3cd48f9
Similarity: 0.8675476190545354
Text: Vincent Willem van Gogh (Dutch: [ˈvɪnsɛnt ˈʋɪləɱ vɑŋ ˈɣɔx] ; 30 March 1853 – 29 July 1890) was a Dutch Post-Impressionist painter who is among the most famous and influential figures in the history...
Node ID: 69ef1c64-a5b4-468c-a58c-7d36151961a7
Similarity: 0.8661792475490765
Text: ==== Flowers ====
Van Gogh painted several landscapes with flowers, including roses, lilacs, irises, and sunflowers. Some reflect his interests in the language of colour, and also in Japanese ukiy...
Node ID: f971a611-a8b9-48b4-a81b-d3856438aab8
Similarity: 0.8616832203971132
Text: ==== Portraits ====
Van Gogh said portaiture was his greatest interest. "What I'm most passionate about, much much more than all the rest in my profession", he wrote in 1890, "is the portrait, the...
test_query = "what is the popular tourist attraction in San Francisco"
# generate retrieval results
retriever = index.as_retriever(similarity_top_k=3, image_similarity_top_k=5)
retrieval_results = retriever.retrieve(test_query)
retrieved_image = []
for res_node in retrieval_results:
if isinstance(res_node.node, ImageNode):
retrieved_image.append(res_node.node.metadata["file_path"])
else:
display_source_node(res_node, source_length=200)
plot_images(retrieved_image)
Node ID: 8c14be3e-345a-4764-9b64-dacff771bc04
Similarity: 0.8689195893277072
Text: === Tourism and conventions ===
Tourism is one of San Francisco's most important private-sector industries, accounting for more than one out of seven jobs in the city. The city's frequent portraya...
Node ID: 22aa7d86-017f-433d-98dc-4007d9f67c17
Similarity: 0.8452524742723133
Text: === LGBT ===
San Francisco has long had an LGBT-friendly history. It was home to the first lesbian-rights organization in the United States, Daughters of Bilitis; the first openly gay person to ru...
Node ID: 3846a17a-79d8-415e-9bcf-76c818b27203
Similarity: 0.8329496262980858
Text: == Parks and recreation ==
Several of San Francisco's parks and nearly all of its beaches form part of the regional Golden Gate National Recreation Area, one of the most visited units of the Natio...
test_query = "which company makes Tesla"
# generate retrieval results
retriever = index.as_retriever(similarity_top_k=3, image_similarity_top_k=5)
retrieval_results = retriever.retrieve(test_query)
retrieved_image = []
for res_node in retrieval_results:
if isinstance(res_node.node, ImageNode):
retrieved_image.append(res_node.node.metadata["file_path"])
else:
display_source_node(res_node, source_length=200)
plot_images(retrieved_image)
Node ID: 214c61be-dad6-403c-b301-bc2320b87e7a
Similarity: 0.7808396168295813
Text: The Tesla Model S is a battery electric full-size luxury sedan with a liftback body style built by Tesla, Inc. since 2012. The Model S features a battery-powered dual-motor, all-wheel drive layout,...
Node ID: 15b737b4-90e3-443a-87aa-13a7d7e80b87
Similarity: 0.7807424063856144
Text: === P100D ===
The P100D outputs 439 kW (589 hp) and 1,248 N⋅m (920 lbf⋅ft) torque on a dynamometer.As of March 2017, P100D was the world's quickest production vehicle with a NHRA rolling start to 6...
Node ID: e134452b-3031-47b0-a20c-df4fe32f1bcf
Similarity: 0.7754107325086438
Text: === Recalls ===
As of December 2021, Tesla had had seven Model S recalls:
On June 14, 2013, Tesla recalled Model S vehicles manufactured between May 10, 2013, and June 8, 2013, due to improper meth...
test_query = "what is the main character in Batman"
# generate retrieval results
retriever = index.as_retriever(similarity_top_k=3, image_similarity_top_k=5)
retrieval_results = retriever.retrieve(test_query)
retrieved_image = []
for res_node in retrieval_results:
if isinstance(res_node.node, ImageNode):
retrieved_image.append(res_node.node.metadata["file_path"])
else:
display_source_node(res_node, source_length=200)
plot_images(retrieved_image)
Node ID: ff85b136-08c8-465d-96f5-a554c65067d8
Similarity: 0.8461934674061043
Text: ==== Personality ====
Batman's primary character traits can be summarized as "wealth; physical prowess; deductive abilities and obsession". The details and tone of Batman comic books have varied ov...
Node ID: 55f5b842-6fd0-4e45-aef2-27f74f670e82
Similarity: 0.8229623965891602
Text: Batman is a superhero appearing in American comic books published by DC Comics. The character was created by artist Bob Kane and writer Bill Finger, and debuted in the 27th issue of the comic book ...
Node ID: cb4755db-088e-46af-92b2-3a4a3649d9fe
Similarity: 0.8218281955244808
Text: === Enemies ===
Batman faces a variety of foes ranging from common criminals to outlandish supervillains. Many of them mirror aspects of the Batman's character and development, often having tragic...